Breast cancer classification is a critical task in medical data science. The goal is to build a model that can accurately predict whether a tumor is benign or malignant based on a set of features. This project focuses on applying Support Vector Machine (SVM) techniques.
Objective:
Classification Task: Use SVM to predict whether a breast cancer diagnosis is malignant or benign.
Target Variable: Diagnosis classification (malignant or benign).
Evaluation Metrics:
This data used in this project is available in Breast Cancer Wisconsin (Diagnostic) Data Set
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
import pandas as pd
import numpy as np
df = pd.read_csv('/content/drive/MyDrive/2023/data.csv')
df.shape
(569, 33)
The dataset used in this analysis contains 569 rows and 33 columns. Each row represents an individual observation, while the columns correspond to various features or attributes of the data. The first five data observations are shown below:
df.head()
| id | diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | ... | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | Unnamed: 32 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 842302 | M | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | ... | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 | NaN |
| 1 | 842517 | M | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | ... | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 | NaN |
| 2 | 84300903 | M | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | ... | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 | NaN |
| 3 | 84348301 | M | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | ... | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 | NaN |
| 4 | 84358402 | M | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | ... | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 | NaN |
5 rows × 33 columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 569 entries, 0 to 568 Data columns (total 33 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 569 non-null int64 1 diagnosis 569 non-null object 2 radius_mean 569 non-null float64 3 texture_mean 569 non-null float64 4 perimeter_mean 569 non-null float64 5 area_mean 569 non-null float64 6 smoothness_mean 569 non-null float64 7 compactness_mean 569 non-null float64 8 concavity_mean 569 non-null float64 9 concave points_mean 569 non-null float64 10 symmetry_mean 569 non-null float64 11 fractal_dimension_mean 569 non-null float64 12 radius_se 569 non-null float64 13 texture_se 569 non-null float64 14 perimeter_se 569 non-null float64 15 area_se 569 non-null float64 16 smoothness_se 569 non-null float64 17 compactness_se 569 non-null float64 18 concavity_se 569 non-null float64 19 concave points_se 569 non-null float64 20 symmetry_se 569 non-null float64 21 fractal_dimension_se 569 non-null float64 22 radius_worst 569 non-null float64 23 texture_worst 569 non-null float64 24 perimeter_worst 569 non-null float64 25 area_worst 569 non-null float64 26 smoothness_worst 569 non-null float64 27 compactness_worst 569 non-null float64 28 concavity_worst 569 non-null float64 29 concave points_worst 569 non-null float64 30 symmetry_worst 569 non-null float64 31 fractal_dimension_worst 569 non-null float64 32 Unnamed: 32 0 non-null float64 dtypes: float64(31), int64(1), object(1) memory usage: 146.8+ KB
Upon examining the data types, it was found that all features are of type float, while the target variable is of type object (indicating categorical data). Since the purpose of this project is classification, the target variable will be crucial for building the model. Additionally, the 'id' feature, which uniquely identifies each observation, will not be necessary for the classification task and can be excluded from further analysis, as well as 'Unnamed' feature.
df = df.drop(['id', 'Unnamed: 32'], axis = 1)
df.isnull().sum()
| 0 | |
|---|---|
| diagnosis | 0 |
| radius_mean | 0 |
| texture_mean | 0 |
| perimeter_mean | 0 |
| area_mean | 0 |
| smoothness_mean | 0 |
| compactness_mean | 0 |
| concavity_mean | 0 |
| concave points_mean | 0 |
| symmetry_mean | 0 |
| fractal_dimension_mean | 0 |
| radius_se | 0 |
| texture_se | 0 |
| perimeter_se | 0 |
| area_se | 0 |
| smoothness_se | 0 |
| compactness_se | 0 |
| concavity_se | 0 |
| concave points_se | 0 |
| symmetry_se | 0 |
| fractal_dimension_se | 0 |
| radius_worst | 0 |
| texture_worst | 0 |
| perimeter_worst | 0 |
| area_worst | 0 |
| smoothness_worst | 0 |
| compactness_worst | 0 |
| concavity_worst | 0 |
| concave points_worst | 0 |
| symmetry_worst | 0 |
| fractal_dimension_worst | 0 |
After examining the dataset for missing values, it was confirmed that there are none present in any of the features. This means that the data is complete and can be used for analysis without requiring any handling of missing data.
To gain a better understanding of the dataset, summary statistics for the numerical features are generated below. These statistics include measures such as the mean, standard deviation, minimum, and maximum values, as well as the quartiles, providing insight into the distribution and range of the data.
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| radius_mean | 569.0 | 14.127292 | 3.524049 | 6.981000 | 11.700000 | 13.370000 | 15.780000 | 28.11000 |
| texture_mean | 569.0 | 19.289649 | 4.301036 | 9.710000 | 16.170000 | 18.840000 | 21.800000 | 39.28000 |
| perimeter_mean | 569.0 | 91.969033 | 24.298981 | 43.790000 | 75.170000 | 86.240000 | 104.100000 | 188.50000 |
| area_mean | 569.0 | 654.889104 | 351.914129 | 143.500000 | 420.300000 | 551.100000 | 782.700000 | 2501.00000 |
| smoothness_mean | 569.0 | 0.096360 | 0.014064 | 0.052630 | 0.086370 | 0.095870 | 0.105300 | 0.16340 |
| compactness_mean | 569.0 | 0.104341 | 0.052813 | 0.019380 | 0.064920 | 0.092630 | 0.130400 | 0.34540 |
| concavity_mean | 569.0 | 0.088799 | 0.079720 | 0.000000 | 0.029560 | 0.061540 | 0.130700 | 0.42680 |
| concave points_mean | 569.0 | 0.048919 | 0.038803 | 0.000000 | 0.020310 | 0.033500 | 0.074000 | 0.20120 |
| symmetry_mean | 569.0 | 0.181162 | 0.027414 | 0.106000 | 0.161900 | 0.179200 | 0.195700 | 0.30400 |
| fractal_dimension_mean | 569.0 | 0.062798 | 0.007060 | 0.049960 | 0.057700 | 0.061540 | 0.066120 | 0.09744 |
| radius_se | 569.0 | 0.405172 | 0.277313 | 0.111500 | 0.232400 | 0.324200 | 0.478900 | 2.87300 |
| texture_se | 569.0 | 1.216853 | 0.551648 | 0.360200 | 0.833900 | 1.108000 | 1.474000 | 4.88500 |
| perimeter_se | 569.0 | 2.866059 | 2.021855 | 0.757000 | 1.606000 | 2.287000 | 3.357000 | 21.98000 |
| area_se | 569.0 | 40.337079 | 45.491006 | 6.802000 | 17.850000 | 24.530000 | 45.190000 | 542.20000 |
| smoothness_se | 569.0 | 0.007041 | 0.003003 | 0.001713 | 0.005169 | 0.006380 | 0.008146 | 0.03113 |
| compactness_se | 569.0 | 0.025478 | 0.017908 | 0.002252 | 0.013080 | 0.020450 | 0.032450 | 0.13540 |
| concavity_se | 569.0 | 0.031894 | 0.030186 | 0.000000 | 0.015090 | 0.025890 | 0.042050 | 0.39600 |
| concave points_se | 569.0 | 0.011796 | 0.006170 | 0.000000 | 0.007638 | 0.010930 | 0.014710 | 0.05279 |
| symmetry_se | 569.0 | 0.020542 | 0.008266 | 0.007882 | 0.015160 | 0.018730 | 0.023480 | 0.07895 |
| fractal_dimension_se | 569.0 | 0.003795 | 0.002646 | 0.000895 | 0.002248 | 0.003187 | 0.004558 | 0.02984 |
| radius_worst | 569.0 | 16.269190 | 4.833242 | 7.930000 | 13.010000 | 14.970000 | 18.790000 | 36.04000 |
| texture_worst | 569.0 | 25.677223 | 6.146258 | 12.020000 | 21.080000 | 25.410000 | 29.720000 | 49.54000 |
| perimeter_worst | 569.0 | 107.261213 | 33.602542 | 50.410000 | 84.110000 | 97.660000 | 125.400000 | 251.20000 |
| area_worst | 569.0 | 880.583128 | 569.356993 | 185.200000 | 515.300000 | 686.500000 | 1084.000000 | 4254.00000 |
| smoothness_worst | 569.0 | 0.132369 | 0.022832 | 0.071170 | 0.116600 | 0.131300 | 0.146000 | 0.22260 |
| compactness_worst | 569.0 | 0.254265 | 0.157336 | 0.027290 | 0.147200 | 0.211900 | 0.339100 | 1.05800 |
| concavity_worst | 569.0 | 0.272188 | 0.208624 | 0.000000 | 0.114500 | 0.226700 | 0.382900 | 1.25200 |
| concave points_worst | 569.0 | 0.114606 | 0.065732 | 0.000000 | 0.064930 | 0.099930 | 0.161400 | 0.29100 |
| symmetry_worst | 569.0 | 0.290076 | 0.061867 | 0.156500 | 0.250400 | 0.282200 | 0.317900 | 0.66380 |
| fractal_dimension_worst | 569.0 | 0.083946 | 0.018061 | 0.055040 | 0.071460 | 0.080040 | 0.092080 | 0.20750 |
A pairplot will be used to explore the relationships between features and their interaction with the target variable. This visualization helps to examine pairwise relationships and distributions of features within the dataset. The focus will be on a subset of features to reveal correlations and patterns with the target variable. By color-coding the points based on the target variable, any distinct patterns or separations between different classes can be assessed. This analysis aims to provide insights into potential correlations and patterns that may influence feature selection and model building processes.
import seaborn as sns
import matplotlib.pyplot as plt
target = df.columns[0]
features_mean = df.columns[1:11]
plt.figure(figsize=(10, 6))
sns.pairplot(df, vars= features_mean, hue= target)
plt.show()
<Figure size 1000x600 with 0 Axes>
features_se = df.columns[11:21]
plt.figure(figsize=(10, 6))
sns.pairplot(df, vars= features_se, hue= target)
plt.show()
<Figure size 1000x600 with 0 Axes>
features_worst = df.columns[21:31]
plt.figure(figsize=(10, 6))
sns.pairplot(df, vars= features_worst, hue= target)
plt.show()
<Figure size 1000x600 with 0 Axes>
The provided correlation matrix offers a preliminary overview of the dataset's structure and potential relationships among its features. Each square represents the correlation between two variables, visualized through scatterplots. Diagonal plots depict feature distributions, indicating potential skewness or outliers. Preliminary observations suggest a complex interplay between variables, with some exhibiting strong linear correlations while others show more dispersed patterns. The color differentiation within scatterplots hints at potential class separation, which could be a valuable indicator for classification tasks.
To understand the distribution of the target variable within the dataset, a pie chart will be created. This visualization will illustrate the proportion of each class within the target variable, providing a clear view of the balance or imbalance between different classes. By examining the pie chart, it will be possible to assess the relative frequency of each diagnosis category, which can inform subsequent analysis and modeling decisions.
df['diagnosis'] = df['diagnosis'].map({'M': 1, 'B': 0})
import matplotlib.pyplot as plt
diagnosis_counts = df['diagnosis'].value_counts()
plt.figure(figsize=(8, 4))
plt.pie(diagnosis_counts, labels=diagnosis_counts.index, autopct='%1.1f%%', startangle=140)
plt.axis('equal')
plt.title('Distribution of Diagnosis')
plt.show()
The pie chart reveals that class 1 (malignant) constitutes 37.3% of the dataset, while class 0 (benign) makes up the remaining 62.7%. This distribution indicates an imbalance, with a higher proportion of benign cases compared to malignant ones. Such an imbalance may affect the performance of classification models, potentially leading to bias towards the majority class.
Due to the dataset's imbalance, with 37.3% of instances being malignant and 62.7% benign, an undersampling technique will be used to balance the class distribution. This method reduces samples from the majority class (benign) to mitigate bias and improve the model's ability to learn from both classes equally, resulting in a more balanced and fair classification model.
from imblearn.under_sampling import RandomUnderSampler
undersampler = RandomUnderSampler(sampling_strategy = 'auto', random_state = 42)
X, y = undersampler.fit_resample(df.drop('diagnosis', axis = 1), df['diagnosis'])
df = pd.DataFrame(X, columns = df.columns[1:])
df['diagnosis'] = y
plt.figure(figsize=(8, 4))
plt.pie(y.value_counts(), labels=diagnosis_counts.index, autopct='%1.1f%%', startangle=140)
plt.axis('equal')
plt.title('Distribution of Diagnosis')
plt.show()
Feature selection will be conducted by analyzing the correlation between each feature and the target variable. Features with strong positive or negative correlations are more likely to be informative for the model, while weakly correlated features may be less useful. This approach helps in selecting the most predictive features, improving model performance and reducing dataset dimensionality.
corr_matrix = X.corr()
corr_with_target = X.corrwith(y)
selected_features = corr_with_target[abs(corr_with_target) > 0.5].index
print(selected_features)
Index(['radius_mean', 'perimeter_mean', 'area_mean', 'compactness_mean',
'concavity_mean', 'concave points_mean', 'radius_se', 'perimeter_se',
'area_se', 'radius_worst', 'perimeter_worst', 'area_worst',
'compactness_worst', 'concavity_worst', 'concave points_worst'],
dtype='object')
Based on the correlation analysis, 15 out of 30 features were selected for their strong relationship with the target variable. This refined set of features is expected to improve the model’s performance by focusing on the most relevant inputs, reducing dimensionality, and minimizing the risk of overfitting.
X = df[['radius_mean', 'perimeter_mean', 'area_mean', 'compactness_mean',
'concavity_mean', 'concave points_mean', 'radius_se', 'perimeter_se',
'area_se', 'radius_worst', 'perimeter_worst', 'area_worst',
'compactness_worst', 'concavity_worst', 'concave points_worst']]
y = y.to_numpy()
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)
In this project, six different classification methods are employed to build and evaluate the predictive model: Logistic Regression, Decision Tree Classifier, Random Forest, Support Vector Machine (SVM), Gaussian Naive Bayes (NB), and K-Nearest Neighbors (KNN). Each method is assessed using K-Fold Cross-Validation to ensure robust performance evaluation and mitigate potential overfitting. The primary metrics used for evaluation are Accuracy and Area Under the Curve (AUC), which provide insights into the model's overall performance and its ability to distinguish between classes. This comprehensive approach allows for a thorough comparison of various classification techniques to determine the most effective model for the given dataset.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc, accuracy_score
from sklearn.model_selection import KFold
from sklearn.base import clone
def classification_model(model, X, y, k = 5):
kf = KFold(n_splits = k, shuffle = True, random_state = 42)
accuracies = []
aucs = []
plt.figure(figsize = (8, 4))
mean_fpr = np.linspace(0, 1, 100)
tprs = []
for i, (train_index, test_index) in enumerate(kf.split(X)):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
clf = clone(model)
clf.fit(X_train, y_train)
y_prob = clf.predict_proba(X_test)[:, 1]
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
fpr, tpr, _ = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)
accuracies.append(accuracy)
aucs.append(roc_auc)
tprs.append(np.interp(mean_fpr, fpr, tpr))
tprs[-1][0] = 0.0
plt.plot(fpr, tpr, lw = 2, alpha = 0.6, label = f'ROC fold{i+1} (AUC - {roc_auc:.2f})')
mean_tpr = np.mean(tprs, axis = 0)
mean_tpr[-1] = 1.0
mean_auc = auc(mean_fpr, mean_tpr)
plt.plot(mean_fpr, mean_tpr, color = 'blue', label = f'Mean ROC (AUC - {mean_auc:.2f})', lw = 2, alpha = 1)
plt.plot([0, 1], [0, 1], color = 'gray', lw = 2, linestyle = '--', label = 'Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc = 'lower right')
plt.show()
mean_accuracy =np.mean(accuracies)
print(f'Mean Accuracy: {mean_accuracy:.4f}\n')
print(f'Mean AUC: {mean_auc:.4f}\n')
for i in range(k):
print(f'Fold{i+1}-Accuracy: {accuracies[i]:.4f}, AUC: {aucs[i]:.4f}')
return accuracies, aucs
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
accuracies, aucs = classification_model(model, X, y, k = 5)
Mean Accuracy: 0.9339 Mean AUC: 0.9846 Fold1-Accuracy: 0.9647, AUC: 0.9983 Fold2-Accuracy: 0.8941, AUC: 0.9770 Fold3-Accuracy: 0.9529, AUC: 0.9877 Fold4-Accuracy: 0.9529, AUC: 0.9961 Fold5-Accuracy: 0.9048, AUC: 0.9852
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
accuracies, aucs = classification_model(model, X, y, k = 5)
Mean Accuracy: 0.9294 Mean AUC: 0.9300 Fold1-Accuracy: 0.9294, AUC: 0.9328 Fold2-Accuracy: 0.8824, AUC: 0.8785 Fold3-Accuracy: 0.9412, AUC: 0.9443 Fold4-Accuracy: 0.9176, AUC: 0.9196 Fold5-Accuracy: 0.9762, AUC: 0.9761
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
accuracies, aucs = classification_model(model, X, y, k = 5)
Mean Accuracy: 0.9411 Mean AUC: 0.9838 Fold1-Accuracy: 0.9412, AUC: 0.9967 Fold2-Accuracy: 0.8824, AUC: 0.9611 Fold3-Accuracy: 0.9529, AUC: 0.9924 Fold4-Accuracy: 0.9647, AUC: 0.9970 Fold5-Accuracy: 0.9643, AUC: 0.9943
from sklearn.svm import SVC
model = SVC(probability = True)
accuracies, aucs = classification_model(model, X, y, k = 5)
Mean Accuracy: 0.9363 Mean AUC: 0.9837 Fold1-Accuracy: 0.9529, AUC: 0.9972 Fold2-Accuracy: 0.9059, AUC: 0.9714 Fold3-Accuracy: 0.9412, AUC: 0.9922 Fold4-Accuracy: 0.9412, AUC: 0.9900 Fold5-Accuracy: 0.9405, AUC: 0.9898
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
accuracies, aucs = classification_model(model, X, y, k = 5)
Mean Accuracy: 0.9174 Mean AUC: 0.9751 Fold1-Accuracy: 0.9412, AUC: 0.9872 Fold2-Accuracy: 0.9059, AUC: 0.9653 Fold3-Accuracy: 0.9059, AUC: 0.9625 Fold4-Accuracy: 0.9294, AUC: 0.9917 Fold5-Accuracy: 0.9048, AUC: 0.9858
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier()
accuracies, aucs = classification_model(model, X, y, k = 5)
Mean Accuracy: 0.9316 Mean AUC: 0.9700 Fold1-Accuracy: 0.9647, AUC: 0.9961 Fold2-Accuracy: 0.9176, AUC: 0.9474 Fold3-Accuracy: 0.9412, AUC: 0.9838 Fold4-Accuracy: 0.9176, AUC: 0.9642 Fold5-Accuracy: 0.9167, AUC: 0.9744
The model evaluation results are summarized in the table below, showing the mean accuracy and mean AUC values for each classification method used. This summary provides an overall assessment of the model's performance, highlighting how well each method performs on average across the different folds of cross-validation.
| Classification Model | Mean Accuracy | Mean AUC |
|---|---|---|
| Logistic Regression | 0.9339 | 0.9846 |
| Decision Tree | 0.9270 | 0.9269 |
| Random Forest | 0.9434 | 0.9838 |
| Support Vector Machine | 0.9363 | 0.9837 |
| Gaussian Naive-Bayes | 0.9174 | 0.9751 |
| K-Nearest Neighbor | 0.9136 | 0.9700 |
The results indicate that the Random Forest classifier achieved the highest accuracy, suggesting it is the most effective method for classifying breast cancer cases. This reflects its strong performance in correctly identifying both malignant and benign instances. Conversely, Logistic Regression recorded the highest AUC, highlighting its superior discriminatory power. This means that Logistic Regression is particularly adept at distinguishing between the two classes, providing more precise predictions on the likelihood of a diagnosis.
!jupyter nbconvert --to html /content/Evaluating_Machine_Learning_Models_for_Breast_Cancer_Classification.ipynb
[NbConvertApp] Converting notebook /content/Evaluating_Machine_Learning_Models_for_Breast_Cancer_Classification.ipynb to html [NbConvertApp] Writing 9350995 bytes to /content/Evaluating_Machine_Learning_Models_for_Breast_Cancer_Classification.html